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ABSTRACT 


This paper presents a winning solution for the CCKS-2020 financial event extraction task, where the goal 
is to identify event types, triggers and arguments in sentences across multiple event types. In this task, we 
focus on resolving two challenging problems (i.e., low resources and element overlapping) by proposing a 
joint learning framework, named SaltyFishes. We first formulate the event extraction task as a joint probability 
model. By sharing parameters in the model across different types, we can learn to adapt to low-resource 
events based on high-resource events. We further address the element overlapping problems by a mechanism 
of Conditional Layer Normalization, achieving even better extraction accuracy. The overall approach achieves 
an F1-score of 87.8% which ranks the first place in the competition. 


1. INTRODUCTION 


The CCKS-2020 financial event extraction task® aims at extracting structural events by identifying event 
types, triggers and arguments in sentences across multiple types. Figure 1 gives an example of event 
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extraction for a financial news sentence. One structural event belongs to the type of ##/investment, along 
with the trigger MciW/acquire and its arguments providing more complementary details. Note that this 
sentence contains more than one event, and the trigger and arguments have overlaps across the events. 


The CCKS-2020 task provides two kinds of such event sentences. The first one contains five types of 
events associated with an abundant sentence corpus, called source events. The second one contains another 
five types of events associated with low-resource sentence corpus, called target events. Each type of event 
sentence is split into training (labeled data) and testing (unlabeled data) parts. Our goal is to evaluate the 
performance of event extraction on the test set of target events. This poses two main challenges compared 
to the traditional event extraction tasks [1,2,3,4,5]: 


e The target events contain only 179 training sentences on average for each type. This limited supervision 
information cannot provide sufficient contextual information for the event extraction. 
e Elements can be overlapped with each other, i.e., the same trigger or argument may belong to different 


events. As shown in the example, the trigger cJW/acquire and the argument tH414iii/Shijihuatong 
belong to both event types of 4%/investment and /i<(kAM4¢i/share transfer. Performing event 
extraction by a simple sequence labeling method will cause label conflicts. 


# Sentence 
tE ED EN 298.03{Z7C/ KA BERRE 100%/ AAR. 
Shijihuatong/set a price of/ 29.803 billion yuan/ to acquire/ Shengyue Network's/ 100% equity. 


# Event 1 

Event Type: #274/ investment 

Trigger: '89/ acquire 

Arguments 
Sub-company: t##2443%/ Shijihuatong 
Obj-company: 2#kf)48/ Shengyue Network 
Money: 298.03{Z7t/ 29.803 billion yuan 


# Event 2 

Event Type: ARHAR EE share transfer 

Trigger: K ió/ acquire 

Arguments 
Sub-company: 2:khJ24/ Shengyue Network 
Obj-company: t#424i%/ Shijihuatong 
Money: 298.03{Z3t/ 29.803 billion yuan 
Proportion: 100%/ 100% 
Collateral: B&U equity 


Figure 1. Example of event extraction with element overlapping problem. 


To address these challenges, we devised a joint learning method. In our approach, the overall framework 
is formulated as a joint probability model, which is decomposed into submodels, i.e., the joint distribution 
is decomposed into a product of three conditional distributions. Each subtask will be a specific use of this 
distribution, including event type detection, trigger extraction and argument extraction. 
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For the first subtask, given a financial news sentence, we first classify the sentence into a correct event 
type by using a multi-class multi-label text classification paradigm. For the other two subtasks, we successively 
extract triggers and arguments with a pre-training/fine-tuning framework. The pre-training module is 
implemented by a pre-trained language model BERT [6] on all financial news sentences, and we further 
fine-tune the pre-trained model with respect to the trigger/argument extraction module. To deal with the 
overlapping element issue, we introduce conditional layer normalization, only extracting triggers according 
to the specific event type, and extracting arguments according to the specific trigger. This method can 
extract elements separately in different conditions, avoiding overlapping. In addition, by sharing parameters 
across different types in such a unified model, we can learn to adapt to low-resource events based 
on high-resource events. Our approach achieved an Fl-score of 87.8% which ranked the first in the 
CCKS-2020 financial event extraction competition. 


2. RELATED WORK 


Traditional event extraction research is usually achieved in the high-resource setting, and assumes events 
appearing in sentences without overlapped elements. These studies can be roughly categorized into the 
following two groups: 


1) Traditional joint methods [4,5,7,8,9] that perform trigger extraction and argument extraction 
simultaneously. They solve the task in a sequence labeling manner, and extract triggers or arguments 
by tagging the sentence only once. However, these methods fail in extracting overlapped elements 
since the overlapped elements will cause label conflicts when forced to have more than one label. 

2) Pipeline methods [1,10,11,12,13,14] that perform trigger extraction and argument extraction in 
separate stages. Though the pipeline methods enable extracting overlapped elements in separate 
stages [14], they usually lack explicit dependencies between the triggers and arguments and suffer 
from error propagation. All the above methods require sufficient training data to learn model 
parameters for each event type, and few methods can extract complex overlapped elements in event 
extraction. 


Recently, several methods were proposed to solve event extraction in several kinds of low-resource 
setting, such as few-shot learning setting [2], zero-shot learning setting [3,12] and incremental learning 
setting [15]. However, these methods cannot be directly transferred to this CCKS-2020 financial event 
extraction task, for the reason that this task aims at extracting low-resource target events with the help of 
rich-resource source events, which is a completely different setting comparing to the above low-resource 
event extraction settings. 


3. OUR APPROACH 


We will present the overview, the design of each component, and some strategies for improvements. 
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3.1 Overview 


Given a sentence denoted as s, we proposed a joint learning approach to identify its event types C, 
event triggers T and event arguments A. The approach is formulated as a joint probability model, which is 
decomposed into three submodels with respect to the event type detection, the event trigger extraction and 
the event argument extraction: 


P(C, T, Als; ©) = P(C|s; ©,) P(T|s, C; ©, O;) P(A|s, C, T; ©, ©.) 


The event type detection is modeled by a multi-class multi-label text classification paradigm, where ©, 
is the set of type detection model parameters. The other two extraction parts are modeled by a pre-training/ 
fine-tuning framework, where ©, contains model parameters shared by both modules, while ©, and ©, are 
respective private model parameters. All parameters in © £ {O,, ©,, ©;, ©,} are used across different event 
types (either high-resource or low-resource), which promotes rich interactions between source events and 
target events. Figure 2 sketches the overall framework. 


Argument Extraction Module 


Trigger Extraction Module 


. P(T|s,c) FCN 
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00000000 2000 0000mm 
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Figure 2. The overall framework of the financial event extraction approach. 


3.2 Event Detection Model 


In order to discover the event types occuring in the sentence, we adopted codes provided by official 
competition® as our event detection model (EDM). This model utilizes a pre-trained language model (PLM) 
to derive sentence representations, formulated as a multi-label multi-class text classification task. Specifically, 
given the sentence s, the probability of s belonging to a specific type c,, is calculated as in Equation (1): 


®  https://github.com/xyionwu/ccks-2020-finance-transfer-ee-baseline 
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P(Cn |S; 0) = sigmoid(w,, ` Zono (1) 


where Z,on is the hidden state corresponding to the input token <CLS> in the PLM, which encodes the 
entire sentence representation of s; ©, includes all parameters used in the PLM. For simplicity, we denoted 
P(Cm|5; 1) aS Pm- 


Then, we can update and obtain the desired sentence representation Zen by minimizing following binary 
cross entropy loss function with Equation (2): 


Loss(®,) = YY $,” Yam * 108 (Pam) + (1- Yam) * 108 (1= Pr) (2) 


where N is the number of the training sentences; M is the number of the pre-defined event types; y,, is 
the true type label, which is either O or 1. During prediction, we simply set a threshold ô and selected the 
resultant event types C where each type c,, such that p(c,,|s; ©,) > ô. 


3.3 Event Extraction Model 


This section introduces our event extraction model (EEM), achieving two subtasks by a pre-training/fine- 
tuning framework: trigger extraction and argument extraction. The pre-training part encodes sentence tokens 
as contextualized representations with the pre-trained language model, BERT [6], which contains rich 
language knowledge widely used for natural language processing (NLP) tasks. The fine-tuning part is divided 
into three modules, including a shared module to encode condition information based on conditional layer 
normalization, and two private modules to extract triggers and arguments. Note that both extraction modules 
have a similar model structure. 


3.3.1 Shared Module 


This section introduces a sentence representation layer shared by both trigger extraction and argument 
extraction, which will derive a conditional sentence representation H, 
a syntactic feature representation Hyn. 


-yp for the specific event type c, and 


Since we have obtained event types C occuring in the given sentence s, we are going to derive sentence 
representations conditioned on each specific event type c € C, so as to avoid element overlapping issues. 
To this end, we introduced a general module, named conditional layer normalization (CLN) [16,17], to 
integrate such conditional information into sentence representation. CLN is mostly based on the well- 
known layer normalization [18], but can dynamically generate gain y and bias B based on the condition 
information. Given a condition representation c and a sentence representation x, CLN is formulated in 
Equations (3) to (5): 


rer 


CLN(x,c)=¥, o(*#}s B- (3 
O 
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d 
ps Sko -u (4) 
Ye =W,c + b,,B, = Wc + bg (5) 


where x; is the i-th dimension element in x, ye € R and Be € R are the conditional gain and bias, 
respectively. In this way, the given condition representation is encoded into the gain and bias, and then 
integrated into the contextual representations. 


Then we utilized CLN to integrate event type information into the sentence. Specifically, we first 
transformed the event type’s name into textual tokens, such as the type }¥%/investment which was 
transformed into tokens #% and %%. Then we concatenated these type tokens together with the word tokens 
in the sentence s, forming a sequence as X : <CLS> + type tokens + <SEP> + word tokens + <SEP>. 
The sequence was input into the PLM to derive its contextualized representations, and we termed the 
representations corresponding to type tokens as H. and word tokens as H,. Then, we fused H. with mean 
pooling and H, together, to derive the conditional token representations for s with Equation (6): 

H, -yp = CLN(H,, MeanPooling (H. )) (6) 


s-typ 


where H, is the token representations conditioned on the event type c. Such process generates type-aware 


s-typ 
token representations adaptive to various event types. As such, we can perform trigger extraction and 


argument extraction in the independent context of each type. 


3.3.2 Private Module 
The private module contains the following two sub-modules. 
(1) Trigger Extraction Module (TEM) 
This module extracts event triggers given the event type c € C. In order to improve textual representations 


for trigger extraction, we adopted a self-attention (SA) layer. Thus the type-aware token representations can 
be enhanced as in Equations (7) and (8): 


eee SA(H, yp) (7) 


Ha = H,- BA H (8) 


tri s—typ sa—typ syn 


where ® is the concatenation operation. H., Corresponds to the representation of syntactic features, 
obtained by NLP tool LTP®. The syntactic features include B/I/O labels of word segmentation, part-of-speech 
tagging, named entity recognition and dependency parsing, which are initialized randomly as learnable 
embeddings in our model. 


® http://ltp.ai/ 
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In order to strengthen interactions among triggers of different event types, we predicted triggers with the 
same trigger extractor. For each token, we adopted a pair of fully connected networks (FCN) to predict 
whether it is a “begin” or “end” position of a trigger as summarized in Equations (9) and (10): 


p(t |x,,c;®,,0,) a sigmoid (w i hy) 


p(t|x,,c;®,,05) = sigmoid (w * ha) 0o 


where h,,; is the i-th element of h„, wy») and we are learnable parameters, and ©, includes wœ, wy, and 
parameters in SA. 


Then, a binary cross entropy loss function was used for begin position prediction and end position 
prediction, denoted as Loss,» and Loss,e). The final loss is defined as in Equation (11): 


Loss, (©,,0,) = w, * Loss, (©,,0;) + (1-w,) * Loss, (02,05) (11) 


tri 


where w, € (0,1) is a trade-off factor. For prediction, we simply set a threshold Qn, and selected positions 
such that their prediction scores are higher than @,,;. We matched the begin position with the nearest end 
position to obtain a complete trigger. The final trigger extraction results formed the trigger set T. 


(2) Argument Extraction Module (AEM) 

This module is to extract arguments conditioned on one of the triggers T extracted from the TEM. Given 
a specific trigger te T in the sentence s, we obtained trigger-aware sentence representation H,,,, conditioned 
on t, where the process was the same as Equation (7). We also utilized self-attention layer to enhance the 
sentence representation, termed as H,..;,;. To discern the position of trigger t, we further added its relative 
position embedding R, which measured the distance from current position to the trigger position. The 
syntactic feature H,,, 
is defined as in Equation (12): 


was also taken into consideration. Thus, the enhanced sentence overall representation 


H =H 


arg s-tri ® H ® R ® Pon (1 2) 


sa-tri 


For argument extraction, we extracted all arguments with pairs of FCNs and devised them as in 
Equations (13) and (14): 


plat |x;,c,t;®,,0,) = sigmoid w * Bas) (13) 


p(alx,,c,t;0,,0,) = sigmoid [w p ü hes} (14) 


where hag; is the i-th element of Hag. W œ and w are learnable parameters for the k-th argument role. 


arg’ 
©, includes w w, W,., and all the parameters in the SA and CLN. Note that the number of the pre-defined 
ak a 


k 
argument roles is K, and there are 2K prediction sequences for all the argument roles. 
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The loss function is also binary cross entropy for both begin and end position prediction for each 
argument role and it is calculated with Equation (15): 


Loss,,,(,,0,) = yO, * Loss » (O,,0,) + (1-w,) * Loss œ (O,,0,) (15) 


arg 


where w, € (0,1) is a tradeoff factor. For prediction, we simply set a threshold Q., and selected positions 
were prediction score is higher than ag. We matched the begin position with the nearest end position to 
obtain a complete argument. We removed the redundant argument types for each event type based on the 
event schema constrain. The final results formed the argument set A. 


3.3.3 Training and Prediction 


To jointly learn the TEM and AEM, we combined both losses from the two modules as in Equation (16): 
Loss(®,,0,,0,) = w; * Loss,;(©,,0,) +(1-w,) * Loss, (©,,0,) (16) 
where w, € (0,1) is a weight hyperparameter to balance the two modules. 


We utilized ground-truth labels to train the overall model. For prediction, we first obtained trigger 
extraction results, and then input them into the argument extraction module. The results obtained from the 
two modules were returned as the final predictions. 


3.4 Additional Improvement Strategy 


Despite the training data sets, the unlabeled data in the testing data sets also contain rich information. 
In order to exploit all the data to improve the performance, we also employed the following strategies: 


(1) Continuing Pre-training on PLM 

PLMs are usually pre-trained on the common corpus, which may cause semantic bias on the financial 
corpus. Therefore, we continued pre-training the PLM on all the financial data, including training data and 
testing data. This strategy was applied to both EDM and EEM. 


(2) Model Ensemble on Variant Data Splits 

To fully exploit labeled data, we adopted K-fold validation on the labeled data, leading to K models 
trained on different data splits. Then, we ensembled K model predictions by the voting strategy. This model 
ensemble strategy was applied to EDM and EEM separately. 


(3) Utilizing Pseudo-Labels on Unlabeled Data 

To fully exploit unlabeled data, we employed a novel strategy to label testing data with pseudo labels. 
Specifically, we trained models on the ground-truth data, and then predicted labels on those unlabeled 
data, called pseudo-labeled data. By integrating the pseudo-labeled data into the ground-truth data, we 
obtained a mixed training data set. We trained new models on this mixed data set. Note that this strategy 
was only used for EEM, where we achieved better performance. 
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4. EXPERIMENT 


This section introduces the data set provided in the competition, and conducts experiments to evaluate 
the model. 


4.1 Data Set 


The data set provided in the competition contains source event data and target event data, including 
labeled data and testing data for each event type. The statistics of each event type of labeled data is shown 
in Table 1. The competition only evaluated on the testing target event data. For validation, we separated a 
part of labeled data as validation data. The details of data partition are shown in Table 2. Since the ground 
truth of the testing data was not available, all experiments below were conducted on the validation data. 


Table 1. Statistics of each event type in the data set. 


Source type Die BX Bett FELE E IREF iE 
pledge investment share transfer reduction prosecution 

Data size 815 1083 1581 670 533 

Target type WCW FAR PER re E FAK 
acquisition judgment win bid sign contract guarantee 

Data size 200 200 200 132 163 


Table 2. Data partition for training, validation and testing. 


Training Validation Testing 
Source type 2,459 273 163,763 
Target type 738 82 93,610 


4.2 Implementation 


We utilized an extended BERT model as our PLM, which was pre-trained on the mixed large Chinese 
corpus® (termed as BERT-ext) from model zoo®. Then we continued pre-training it on this financial data. 
For EDM, we set the learning rate to 2e-5. The batch size was 16. For EEM, we applied a learning rate of 
2e-5 to the PLM layer and 1e-4 to other layers. The batch size was 8. The tradeoff weight w, w, and w; 
was set to 0.5, 0.5, and 0.2, respectively. Each kind of syntactic embedding dimension was set to 40. The 
relative position embedding dimension was set to 64. We applied dropout to the SA layer and all input 
embeddings with the rate set to 0.3. With the model ensemble strategy, we trained five EDMs for a better 
event type prediction. For EEM, we trained 5 models, and ensembled the 5 results to obtain pseudo label 
on the testing data. Then, we trained 10 EEMs on the mixed training data, and obtained 10 predictions on 
the testing data. We ensembled all 15 EEM results as the final submissions. 


® https://share.weiyun.com/5G90sMJ 
@ https://github.com/dbiir/UER-py/wiki/Modelzoo 
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4.3 Main Result 


Since the ground truth of testing data was not available, we conducted experiments on the validation 
data. The F1-score of event detection, trigger extraction and argument extraction on validation data was 
0.921, 0.970, and 0.889, respectively. 


The best result (F1-score) of our approach on official testing data was 0.8781, which was the highest 
score in the competition. 


4.4 Ablation Study 


We conducted an ablation study on the event extraction model, where the results are shown in Table 3. 
As the entire decoding process is a pipelined paradigm, the performance of each submodel is affected by 
the previous predicted results. To avoid this impact, we adopted ground-truth results as the input of each 
submodel. Specifically, Line1 in Table 3 shows the complete model, which was trained on both ground- 
truth data and pseudo-label data with all components. Line 2 in Table 3 removed the pseudo-label data, 
and the results show that utilizing pseudo-label data improved performance significantly. The following 
experiments were ablated based on Line 2. Line 3 removed source data in training, and the result indicated 
that learning target events with source events was effective. Line 4 and Line 5 in Table 3 replaced the 
continued pre-trained PLM by BERT and the standard BERT-ext, which suggests the effectiveness of 
continuing pre-training for PLM. Line 6 in Table 3 replaced CLN with a simple concatenate operation, 
which indicates CLN can utilize condition information more effectively. Line 7 in Table 3 applied the same 
learning rate to all layers, which indicates utilizing different learning rates on model layers benefits the 
learning process. Line 8 in Table 3 removed syntactic features, which indicates syntactic features improved 
the extraction performance. All the results demonstrated the effectiveness of each component in the task. 


Table 3. Results on validation data. 


Trigger Extraction Argument Extraction 


F1-mean 
P R F1 P R F1 
1. Complete model .969 .979 .970 .844 .969 .889 -930 
2. w/o pseudo-label data .940 952 939 845 .863 838 888 
3. w/o source data .941 .945 .934 .818 .865 .823 .878 
4. repl PLM: BERT .901 .924 .904 .789 .825 .789 .846 
5. rep! PLM: BERT-ext 931 .938 .929 .837 .886 .828 .879 
6. rep! CLN: concat 931 931 .926 .807 .860 .814 .870 
7. w/o layer Ir .946 945 940 799 .869 816 878 
8. w/o syntactic feature 921 .924 .917 -863 .874 .856 .887 


4.5 Case Study 


To demonstrate the predicted results of the model, we conducted case study for model predictions. 
Figure 3 depicts an example of the model results. Given an input sentence, the model sequentially conducted 
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event detection, trigger extraction and argument extraction. The model first predicted all event types 
occurring in the sentence. Then, the model extracted triggers according to the given event type, respectively. 
Next, the model extracted all arguments according to the given event type and the given trigger. Such a 
process extracted overlapped elements separately. Besides, by sharing parameters across different types in 
such a unified model, the model learned to extract low-resource events based on the high-resource events. 


# Sentence 
tE EM {Eft 298.03{Z7C/ KA BIRS 100%/ ABA. 
Shijihuatong/set a price of/ 29.803 billion yuan/ to acquire/ Shengyue Network's/ 100% equity. 


# Event Detection Model 
-> given: the sentence prediction: #2#4/ investment 
ARS ARARSSIL/ share transfer 


# Trigger Extraction Module 
-> given: type: #23#4/ investment prediction: {R4/ acquire 
-> given: type: RH RAR L/ share transfer prediction: {{4/ acquire 


# Argument Extraction Module 
-> given: prediction: 
type: #£#7/ investment; Sub-company: t#424£34/ Shijihuatong 
trigger: t RA/ acquire Obj-company: 22k h)44/ Shengyue Network 
Money: 298.03{Z,7¢/ 29.803 billion yuan 
-> given: prediction: 
type: RARA E share transfer Sub-company: 2k /24/ Shengyue Network 
trigger: Kt RY/ acquire Obj-company: t#£244i%/ Shijinuatong 
Money: 298.03{Z3t/ 29.803 billion yuan 
Proportion: 100%/ 100% 
Collateral: - 


Figure 3. An example of the model predictions. 


Though the model attempted to solve the low-resource issue and the element overlapping issue, there 
existed two main error patterns: 1) the error proporgation problem: Since the subtasks were conducted in 
a cascading manner, the errors of the former predictions would lead to the errors of the following predictions; 
2) the argument prediction errors: Since the arguments were complicatedly associated with their roles, the 
model tended to predict less arguments or predict arguments with wrong roles. We would attempt further 
improvements in the future. 


5. CONCLUSION 


In this paper, we proposed a financial event extraction approach based on a joint learning framework, 
which fully utilizes all the data to improve the performance of low-resource event types, and effectively 
solves the overlapping problem of events. The experimental results show that the approach achieved 
significant performance, and it ranked the first place in the CCKS-2020 financial event extraction 
competition. 
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